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Abstract 



O , 

■ We initiate the study of sparse recovery problems under the Earth-Mover Distance (EMD). 

Specifically, we design a distribution over m x n matrices A such that for any x, given Ax, we 
can recover a fc-sparse approximation to x under the EMD distance. One construction yields 
m = 0{klog{n/k)) and a 1 + e approximation factor, which matches the best achievable bound 
for other error measures, such as the £i norm. 

Our algorithms are obtained by exploiting novel connections to other problems and areas, 
such as streaming algorithms for k-median clustering and model-based compressive sensing. We 



^ \ also provide novel algorithms and results for the latter problems. 

CN ! 1 Introduction 
> 

. In recent years, a new "linear" approach for obtaining a succinct approximate representation of 

\ n-dimensional vectors (or signals) has been discovered. For any signal x, the representation is equal 

■ to Ax, where ^ is an m x n matrix, or possibly a random variable chosen from some distribution 

. over such matrices. The vector Ax is often referred to as the measurement vector or linear sketch 

of X. Although m is typically much smaller than n, the sketch Ax often contains plenty of useful 
information about the signal x. 

A particularly useful and well-studied problem is that of stable sparse recovery. The problem is 
typically defined as follows: for some norm parameters p and q and an approximation factor C > 0, 
given Ax, recover an "approximation" vector x* such that 



— a^*|L < C* min llx — x'll (1) 

^ A;-sparse x' 1 

where we say that x' is /c-sparse if it has at most k non-zero coordinates. Sparse recovery has 
applications to numerous areas such as data stream computing |Mut05t IInd07] and compressed 
sensing |CRT061 IDon06| , notably for constructing imaging systems that acquire images directly in 
compressed form (e.g., |DDT^08| IRom09] ). The problem has been a subject of extensive study 
over the last few years, with the goal of designing schemes that enjoy good "compression rate" (i.e., 
low values of m) as well as good algorithmic properties (i.e., low encoding and recovery times). 
It is known by no that there exist matrices A and associated recovery algorithms that produce 



*This research has been supported in part by the David and Lucille Packard Fellowship, MADALGO (the Center 
for Massive Data Algorithmics, funded by the Danish National Research Association) and NSF grant CCF-0728645. 
E. Price has been supported in part by an NSF Graduate Research Fellowship. 

^In particular, a random Gaussian matrix |CRT06| or a random sparse binary matrix ( [BGI"'"08] . building on 
[CCFC02I [CM04I [CM06] ') has this property with overwhelming probability. See [GllOj for an overview. 
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Figure 1: Our results 



approximations x* satisfying Equation ([T]) with ip = iq = ii, constant approximation factor C, 
and sketch length m = 0{klog{n/k)); it is also known that this sketch length is asymptotically 
optimal jDIPWlOj IFPRUlO] . Results for other combinations of £p/iq norms are known as well. 

However, limiting the error measures to variants of ip norms is quite inconvenient in many 
applications. First, the distances induced by £p norms are typically only quite raw approximations of 
the perceptual differences between images. As a result, in the field of computer vision, several more 
elaborate notions have been proposed (e.g., in [RTGOOt ILow04t Lyu05 , [GD05j ). Second, there are 



natural classes of images for which the distances induced by the £p norm are virtually meaningless. 
For example, consider images of "point clouds", e.g., obtained via astronomical imaging. If we 
are given two such images, where each point in the second image is obtained via small random 
translation of a point in the first image, then the ip distance between the images will be close to 
the largest possible, even though the images are quite similar to each other. 

Motivated by the above considerations, we initiate the study of sparse recovery under non- 
ip distances. In particular, we focus on the Earth-Mover Distance (EMD) [RTGOO] . Informally, 
for the case of two-dimensional A x A images (say, x,y : [A]-^ — )■ M+) which have the same ii 
norm, the EMD is defined as the cost of the min-cost flow that transforms x into y, where the 
cost of transporting a unit of mass from a pixel p £ [A]^ of x to a pixel q € [A]^ of y is equal 
to the ii distancqj between p and q. The EMD metric can be viewed as induced by a norm 
IMI_EAfD' such that EMD(x,y) = ||x — y\\EMD'^ ^ee Section [2] for a formal definition. Earth-Mover 
Distance and its variants are popular metrics for comparing similarity between images, feature sets, 
etc. IRTG001IGD05] . 

Results. In this paper we introduce three sparse recovery schemes for the Earth-Mover Dis- 
tance. Each scheme provides a matrix (or a distribution of matrices) A, with m rows and n columns 
for n = A-^, such that for any vector x, given Ax, one can reconstruct a vector x* such that 

\\x-x*\\emd<C mm ,\\x - x'\\ (2) 

fe-sparse x' 

for some approximation factor C > 0. We call any recovery scheme satisfying Equation ([2|) an 
EMD/EMD recovery scheme. If j4 is a distribution over matrices (that is, the scheme is randomized), 
the guarantee holds with some probability. The other parameters of the constructions are depicted 
in Figure [H 

In particular, two of our constructions yield sketch lengths m bounded by 0{klog{n/k)), which 
mimics the best possible bound achievable for sparse recovery in the ii distance |DIPW10] . Note, 
however, that we are not able to show a matching lower bound for the EMD case. 

Connections and applications What does sparse recovery with respect to the Earth-Mover 
Distance mean? Intuitively, a sparse approximation under EMD yields a short "signature" of 



^One can also use the £2 distance. Note that the two distances differ by at most a factor of \/2 for two-dimensional 
images. 
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the image x that approximately preserves its "EMD properties". For example, if x consists of a 
small number of sparse point clouds (e.g., as in astronomical imaging), sparse approximation of x 
will approximately identify the locations and weights of the clouds. Our preliminary experiments 
with a heuristic algorithm for such data [GlPlOj show that this approach can yield substantial 
improvements over the usual sparse recovery. 

Another application [RTGOO] stems from the original paper, where such short signatures were 
constructecil for general images, to extract their color or texture information. The images were then 
replaced by their signatures during the experiments, which significantly reduced the computation 
time. 

The above intuitions can be formalized as follows. Let x' be the minimizer of — x'\\^f^^^ over 
all fc-sparse vectors. Then one can observe that the non-zero entries of x' correspond to the cluster 
centers in the best A;-mediar0 clustering of x. Moreover, for each such center c, the value of x'^ is 
equal to the total weight of pixels in the cluster centered at c. Thus, a solution to the /c-median 
problem provides a solution to our sparse recovery problem as welH. 

There has been prior work on the fc-median problem in the streaming model under insertions and 
deletions of points |FS051 irnd04] . Such algorithms utilize linear sketches, and therefore implicitly 
provide schemes for approximating the /c-median of x from a linear sketch of x (although they do 
not necessarily provide the cluster weights, which are needed for the sparse recovery problem). Both 
algorithm^ yield a method for approximating the fc-median from Q{k'^ log^^^^ n) measurements, 
with the algorithm of |FS05j providing an approximation factor of 1 + e. In contrast, our result 
achieves an approximation factor of 1 + e with a sketch length m that is as low as 0{klog{n/k)). 

Thanks to this connection, our results also yield short sketches for the A;-median problem. 
Although the solution x* output by our algorithm does not have to be fc-sparse (i.e., we might 
output more than k medians), one can post-process the output by computing the best A:-sparse 
approximation to x* using any off-the-shelf (weighted) fc-median algorithm (e.g., |HPM04] )). This 
reduces the number of clusters to k, while (by the triangle inequality of EMD) multiplying the 
approximation factor by a constant that depends on the approximation constant of the chosen 
/c-median algorithm. See Appendix O for more details. 



Techniques On a high level, our approach is to reduce the sparse recovery problem under EMD 
to sparse recovery under £i. This is done by constructing a linear mapping P that maps M''^] 
into some space M*, that has the property that a "good" sparse approximation to y = Px under ii 
yields a "good" sparse approximation to x under EMD. The list of formal constraints that such 
a mapping needs to satisfy are given in Section [3l For concreteness, we define one such mapping 
below; another one is given in Section [71 Specifically, the pyramid mapping P |IT031 IGDOSj 
(building on |Cha021 1 AV99] ) is defined as follows. First we impose log A -|- 1 nested grids Gi on 

^In fact, the algorithm in [RTGOO] vaguely resembles our approach, in that it uses a kd-tree decomposition to 
partition the images. 

*For completeness, in our context the fc-median is defined as follows. First, each pixel p G [A]^ is interpreted as 
a point with weight Xp. The goal is to find a set C C [n]^ oi k "medians" that minimizes the objective function 
Epg[„]2mincgc||p-c||^a;p. 

^If the algorithm reports both the medians and the weights of clusters. 

^The paper Jnd04 claims m — fclog*^'^' n. Unfortunately, that is an error, caused by ignoring the dependencies 
between the queries and their answers provided by the randomized data structure MediEval. Fixing this problem 
requires reducing the probability of failure of the algorithm so that it is inversely exponential in k, which yields 
another factor of k in the space bound. 

^We note that the aforementioned k-median algorithms implicitly rely on some form of sparse recovery (e.g., see 
Remark 3.10 in |FS05| or remarks before Theorem 5 in jlnd04j). However, the bounds provided by those algorithms 
fall short of what we aim for. 
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[A]^, with G = U Gj. For each level i = . . . I, I = log2 A, the grid Gi is a partition of the image 
into cells of side length 2*. The cells in the grids can be thought of as forming a 4-ary tree, with 
each node c at level i having a set C(c) of children at level i — 1. For each i, we define a mapping 
Pi such that each entry in PiX corresponds to a cell c in Gi, and its value is equal to the sum of 
coordinates of x falling into c. The final mapping P is defined as 

Px = [2^Pox, 2^Pix, . . . , 2^Pix] (3) 

It is easy to see that, for a vector x that is /c-sparse, the vector Px is 0{K) sparse for K = kl. 
We also show that for any x, there exists an 0(i^)-sparse y such that the difference ||y — -Px||i 
is comparable to min/j_spaj.gg 2;/ ||x — "^'IlijMD' then find a good approximation x* to x (in the 
EMD norm) by "inverting" P on y. Since we can recover an 0(i<r)-sparse approximation to y (in 
the ii norm) from a sketch of length 0{Klog{n/K)), we obtain the first result from Figure [TJ 

To improve the sketch length we exploit the particular properties of the mapping P to recover 
an 0(i<')-sparse approximation from only 0{K) measurements. For any non-negative vector x, the 
coordinates of Px have the following hierarchical structure: (i) the coordinates are organized into 
an r-ary tree for r = 4, and (ii) the value of each internal node is non-negative and equal to the sum 
of its children times two. Using one or both of these properties enables us to reduce the number of 
measurements. 

The second algorithm from Figure [T] is obtained using the property (i) alone. Specifically, the 
problem of recovering a sparse approximation whose support forms a tree has been well-studied 
in signal processing (the question is motivated by an empirical observation that large wavelet 
coefficients tend to co-occur in this fashion). In particular, the insightful paper [BCDHIO] on 
model-based compressive sensing (see Section [5] for an overview) gave a deterministic scheme that 
recovers such approximation from a sketch of length 0{K). Although the setup given in that paper 
is somewhat different from what we need here, we show that one can modify and re-analyze their 
scheme to achieve the desired guarantee. This approach, however, leads to an approximation factor 
ofO(v/log(n/A;)). 

In order to achieve a constant approximation factor, we employ both properties (i) and (ii), 
as well as randomization. Specifically, we recover the tree coefficients top-down, starting from the 
root of the tree. This is done in a greedy manner: we only recurse on the children of nodes that are 
estimated to be "heavy" . This first pass identifies a superset S of the locations where Px is large, 
but estimates some of the values {Px)s quite poorly. The set of locations S has |5| = 0{K), so we 
can recover {Px)s accurately with 0{K) measurements using the set query sketches of [Prill j . 

Finally, we show that we can achieve the first and second result in Figure [1] by replacing the 
pyramid mapping by a variant of an even more basic transform, namely the (two-dimensional) 
Haar wavelet mapping. Our variant is obtained by rescaling the original Haar wavelet vectors 
using exponential weights, to mimic the pyramid scheme behavior. This result relates the two well- 
studied notions (EMD and wavelets) in a somewhat unexpected way. As a bonus, it also simplifies 
the algorithms, since inverting the wavelet mapping can now be done explicitly and losslessly. 

2 Preliminaries 

Notation We use [n] to denote the set {1 . . . n}. For any set S C [n], we use S to denote the 
complement of S, i.e., the set [n] \ S. For any x G M", Xi denotes the iih. coordinate of x, and xs 
denotes the vector x' G given by x[ = Xi \i i ^ S, and x[ = otherwise. We use supp(x) to 
denote the support of x. We use Rt^l to denote the set of functions from [A] x [A] to M; note that 
RI^I can be identified with R"- since n = A^. We also use R_|_ to denote {x S R | x > 0}. 
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EMD Consider any two non-negative vectors x,y ^ such that = ||y||i. Let T{x,y) be 

the set of functions 7 : [A]^ x [A]^ — )• R+, such that for any i,j £ [A]^ we have Yli lih = ^-nd 
X^/7(^j) — Vj'i that is, F is the set of possible "flows" from x to y. Then we define 

EMD*(x,y) = min 

to be the min cost flow from x to y, where the cost of an edge is its ii distance. This induces a 
norm IHI^;^/^) such that ||x — vWemd ~ EMD*(x,y). For general vectors w, 

\\M\emd= EMD*(x,2/) + D ||z||i 

x—y+z=w 

Mi=\\y\\i 

x,y>0 

where D = 2 A is the diameter of the set [A]^. That is, lltLiH^j^,^^ is the min cost flow from the 
positive coordinates of w to the negative coordinates, with some penalty for unmatched mass. 



Signal models The basic idea of the signal models framework of |BCDH10] is to restrict the 
sparsity patterns of the approximations. For some sparsity parameteiH K let Sk be a family of 
subsets of [n] such that for each S G Sk we have l^l < K. The family Sk induces a signal model 
Mk CM" where 

Mk = {x e M" I supp(x) C S for some S G Sk}- 

Note that Mk is a union of \Sk\ subspaces, each of dimension at most K. The signals in M.k are 
called Mk- sparse. 

The following two examples of signal models are particularly relevant to our paper: 

1. General fc-sparse signals, where Sk contains all A;-subsets of [n]. In this case the induced signal 
model (denoted by S^) contains all A;-sparse signals. 

2. Tree sparse signals. In this case, we assume that n = for some (constant) integer c and 
parameter /, and associate each i G [n] with a node of a full c-ary tree T{c,l) of depth /. 
The family Sk contains all sets S of size up to K that are connected in T{c,l) and contain 
the root (so each S corresponds to a graph-theoretic subtree of T{c,l)). The induced signal 
model is denoted by T^, or Tk for short o 

In order to facilitate signal recovery, one often needs to consider the differences x — y of two 
signals x G Ai, y G M' . For this purpose we define the Minkowski sum of Mk and M'j^ as 
Mk © M'j^ = {x + y : X e MK,y £ -M-'j^}. To simplify the notation, we define A^^*) to the t-wise 
Minkowski sum of Mk- For all signal models considered in this paper, we have M^l^ C Mki- 

Restricting sparsity patterns enables to recover sparse approximations from shorter sketches. 
We defer a more thorough overview of the results to Section [5l 



Assumptions We assume that the sparsity parameters k (and K, where applicable) are smaller 
than n/2. Note that if this assumption does not hold, the problem becomes trivial, since one can 
define the measurement matrix A to be equal to the identity matrix. 

®We use K to denote the sparsity in the context of model-based recovery (as opposed to k, which is used in the 
context of "standard" recovery). 

^We note that technicaUy this model was originally defined with respect to the wavelet basis (as opposed to the 
standard basis here) and for c = 2. We adapt that definition to the needs in our paper. 
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3 Framework for EMD-sparse recovery 

In this section we describe our approach to reducing sparse recovery under EMD into sparse recovery 
under ii. We need the following three components: (i) a t x n matrix B (that will be used to map 
the EMD space into the £i space); (ii) a signal model 7W C M*; and (iii) an ii/ii recovery scheme 
for Ai. The latter involves an m x t matrix A' (or a distribution over such matrices) such that, for 
any x G M*, given A'x, one can recover x* such that 

llx — < C' min x — x'\\, (4) 

for an approximation factor C . If A' is a distribution over matrices, we require that the guarantee 
holds with some constant probability, e.g., 2/3. 

The mapping B must satisfy the following three properties: 

A. (EMD-to-£i expansion.) For all x G M", 

II^IIemd — ll-^^lli • 

B. (Model-alignment of EMD with M.) For aU x G M" , there exists a y G M with 

\\y-Bx\\^ <e min ,\\x - x'\\ 

fe-sparse x' 

C. (Invertibility.) There is an efficient algorithm : M* — )■ M" such that, for some constant D 
and all y G M*, 

lly — BB~^(y)\\, < D min lly — Bx\\^ . 



Lemma 3.1. Consider B, A' , Ai satisfying the above properties. Then the matrix A = A'B supports 
k-sparse recovery for EMD ( as defined in Equation ([2]) ) with approximation factor C = (1 + D)C'e. 

Proof. Consider the recovery of any vector x G . Let 

E= min lla; — x'|L,,rn • 

fc-sparsc x' " »J^MU 

By Property [HI for any x G M", there exists a y G with 

||y — Bx\\-^ < eE. 

Hence our H-i/d-i model-based recovery scheme for A^, when run on Ax = A'Bx, returns a y* with 

\\y* -Bx\\^ < C'eE. 
Let x* =B^^{y*). We have by Property [C] that 

\\y* - Bx*\\, < D min lly* - Bx'\\, < D \\y* - Bx\\. < DC'eE. 

Hence by Property |A] 

Ik* - x\\emd ^ - x)||i < ||y* - Bx\\i + \\y* - Bx*\\;^ 

< (1 + D)C'eE 

as desired. □ 
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4 Pyramid transform 



In this section we will show that the pyramid transform P defined in Equation ([3]) of Section [T] 
satisfies properties [B] and [C] of Section [3l with appropriate parameters. 

The property [A] has been shown to hold for P in many other papers (e.g., [Cha02l |IT03| ). The 
intuition is that the weight of a cell is at least the Earth-Mover Distance to move all mass in the 
cell from the center to any corner of the cell, including the corner that is at the center of the parent 
of the cell. 



4.1 Model-alignment with tree sparsity 

In this section we show Property [HI where the signal model Ai is equal to the i^-tree-sparse model 
7k, for K = 0{k\og{n/k)). In fact, we show a stronger statement: the trees have their width (the 
maximum number of nodes per level) bounded by some parameter s. We will exploit the latter 
property later in the paper. 

Lemma 4.1. For any x £ M" there exists a tree S C [t] of size K and width s with 

\\{Px)s\\i <e min ,\\x-x'\\ 

for s = {-^k) and K = 0{^k\og{n/k)). 

Proof. Let x' = arg min^.^p^^j-gg \\x — a^'HeMD the fe-medians approximation of x. Consider the 
cells that contain each point in the support of x' . For each such cell at any level i, add the O(^) 
other cells of the same level within an (.i distance of |2*. The resulting S has s = O {^k) cells per 
level, and all the ancestors of any cell in the result also lie in S. So S" is a tree of width s. It has 
0{s) elements from the top log4 s levels, and 0{s) elements on each of the log4 1 — log4 s remaining 
levels, for a size K = 0{slogt/s). We will show that ||(Px)^||-^ is small. 

Define Cj for i G [A]^ to be the elementary vector with a 1 at position i, so Xi = x ■ Ci. Suppose 
that the distance between i and the nearest center in x' is Vi. Then we have 

||(Px)-^||^= ||(-Pxjej)-^||-|^ = II (Pej)-^ll 
11^ ~ ^ Wemd ^ X] 



ie[A] 



so it is sufficient to show ||(Pei)^||j^ < evi for any i. 

Let h be the highest level such that is not contained in a cell at level h in S. If no such h 
exists, ||(Pej)^||^ = 0. Otherwise, Vi > ^2^, or else S would contain e^'s cell in level h. But then 

h 

as desired. □ 
Corollary 4.2. For any x E M" , there exists a y £ Tk with 

\\y-Px\\^ <e min - x'\\ 

k-sparse x' m^ju-i^ 
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4.2 Invertibility 

Given an approximation b to Px, we would like to find a vector y with ||6 — Py\\i small. Note that 
this task can be formulated as a linear program, and therefore solved in time that is polynomial in 
n. In Appendix |A] we show a much faster approximate algorithm for this problem, needed for our 
fast recovery algorithm: 

Lemma 4.3. Given any approximation b to Px, we can recover a y in 0(|supp(6)|) time with 

\\Py-Px\\^ < 8||6-Px||i. 

Recall that P has t = [4n/3j rows. This means standard ii/ii ii'-sparse recovery for Px is 
possible with m = 0{K logt/K) = 0{-^klog'^{n/k)). Hence by Lemma 13. H using B = P and 
standard sparse recovery techniques on the model Ai = J^k gives the first result in Figured} 

Theorem 4.4. There exists a deterministic EMD/EMD recovery scheme with m = 0{-^k log^^n/k)) 
and C = e. Recovery takes 0{nlog'^ n) time for some constant c. 

5 Tree-sparse recovery 

To decrease the number of measurements required by our algorithm, we can use the stronger signal 
model Tk instead of S^. The paper [BCDHIO] gives an algorithm for model-based sparse recovery 
of Tk, but their theorem does not give an ii/ii guarantee. In Appendix [B] we review the prior 
work and convert their theorem into the following: 

Theorem 5.1. There exists a matrix A with 0{K) rows and a recovery algorithm that, given Ax, 
returns x* with 

11^^ — < C \/\o^ji[K) min llx — x'lL 

x'&Tk 

for some absolute constant C > 1. As long as the coefficients of x are integers bounded by n^^^\ 
the algorithm runs in time 0{K'^nlog^ n) for some constant c. 

By Lemma 13. H using this on B = P and M = Tk gives the second result in Figure [1) 

Theorem 5.2. There exists a deterministic EMD/EMD recovery scheme with m = 0{-^k\og{n/k)) 
and distortion C = 0{e^/\og{n/k)). Recovery takes 0{k'^n\og'^ n) time for some constant c. 

6 Beyond tree sparsity 

The previous section achieved 0{y/logn) distortion deterministically with 0{klog{n/k)) rows. In 
this section, we improve the distortion to an arbitrarily small constant e at the cost of making the 
algorithm randomized. To do this, we show that EMD under the pyramid transform is aligned with 
a stronger model than just tree sparsity — the model can restrict the values of the coefficients as 
well as the sparsity pattern. We then give a randomized algorithm for ii/ii recovery in this model 
with constant distortion. 

Definition 6.1. Define to be the family of sets S C [t] such that (i) S corresponds to a 
connected subset of G containing the root and (ii) \Sr\Gi\ < s for all i. We say that such an S is 
iC-tree-sparse with width s. 
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Definition 6.2. Define A4 C Tk as 



/ i supp(y) C S for some S G T^, and \ 



where s = 0{^k) comes from Lemma\4.1 



Note that every y € is non-negative, and {Px)s G for all x G M" . With Lemma l4 .11 this 
implies: 

Lemma 6.3. There is model- alignment of P with M, i.e., they satisfy Property [M 
We will give a good algorithm for li/ii recovery over M. 

6.1 Randomized £i/£i recovery of Ai 

Theorem 6.4. There is a randomized distribution overmxt matrices A withm = 0{^klog{n/k)) 
and an algorithm that recovers y* from Ay in 0{-^k\og{n/k)) time with 

\\V* - vWi < C min ||y - y'\\ 

with probability 1 — k^^^^\ for some constant C . We assume k = r2(loglogn). 

We will give an algorithm to estimate the support of y. Given a sketch of y, it recovers a support 
S G Tj? with 

1 1 Vol 1 1 < 10 min lly — w'lL • 
We can then use the set query algorithm |Prill| to recover a y* from a sketch of size OdS"!) with 

\\y* -ysWi < WusWi- 

Then 

\\y* - y\\i < \\y* - ysWi + \\y - ysWi < 2 \\ys\\i < 20 min \\y - y'\\ . 

as desired. Hence estimating the support of y is sufficient. 

6.2 Finding a good sparse support S to y 

Vectors y' €z M have two properties that allow us to find good supports S £ Tf^ with constant 
distortion using only 0(151) rows. First, supp(y') forms a tree, so the support can be estimated 
from the top down, level by level. Second, each coefficient has value at least twice the sum of the 
values of its children. This means that the cost of making a mistake in estimating the support (and 
hence losing the entire subtree below the missing coefficient) is bounded by twice the weight of the 
missing coefficient. As a result, we can bound the global error in terms of the local errors made at 
each level. 

Of course, y may not be in A^. But y is "close" to some y' G Al, so if our algorithm is "robust", 
it can recover a good support for y as well. Our algorithm is described in Algorithm 16. li 



Lemma 6.5. Algorithm \6. 1\ uses a binary sketching matrix of 0{s\og{n/ s)) rows and takes 0(slog(n/s)) 
time to recover S from the sketch. 
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Definition of sketch matrix A. The algorithm is parameterized by a width s. Let hi be a 
random hash function from Gi to 0(s) for i £ [log(n/s)]. Then define A'{i) to be the 0{s) x \Gi\ 
matrix representing hi, so A'{i)ab = 1 if hi(b) = a and otherwise. Choose A to be the vertical 
concatenation of the ^'(i)'s. 
Recovery procedure. 



procedure FindSupport(A, b) 



nog{n/s) 



G 



log(n/s) 



for i = log(n/s) — 1 ... do 

y* ^ bh^f^j) for j G G{Ti+i] 

Ti ^ argmax 
T'CC(T,+i) 
\T'\<2s 

end for 

log(n/s) 

U TiU U Gi 

i=0 «>log(n/s) 

end procedure 



> Find approximate support S to y from b = Ay 



» Klog(n/s)| < 2s 



o Estimate y over C(T'j+i). 
> Select the 2s largest elements of our estimate. 



Algorithm 6.1: Finding sparse support under Ai 



Proof. The algorithm looks at 0(log(n/s)) levels. At each level it finds the top 2s of 4 x 2s values, 
which can be done in linear time. The algorithm requires a sketch with O (log (n/s)) levels of 0(s) 
cells each. □ 

The algorithm estimates the value of yc{Ti+i) by hashing all of yc^ into an 0(s) size hash table, 
then estimating yj as the value in the corresponding hash table cell. Since y is non-negative, this is 
an overestimate. We would like to claim that the 2s largest values in our estimate approximately 
contain the s largest values in yc{Ti+i)- particular, we show that any yj we miss is either (i) not 
much larger than s of the coordinates we do output or (ii) very small relative to the coordinates 
we already missed at a previous level. 

Lemma 6.6. In Alaorithm \6.1\ for every level i let Wi = maXg^(j^Xi+i)\Ti Ug denote the maximum 
value that is skipped by the algorithm and let fi = ||2/Gi+i\Ti+i ||]^ denote the error from coordinates 
not included in Tj+i. Let Ci denote the s-th largest value in yT^- Then with probability at least 
1 — e"^*^**), Wi < max{^, 2cj} for all levels i. 

Proof. Define s' = 8s > |C(rj+i)|. We make the hash table size at each level equal to n = 32s'. 
We will show that, with high probability, there are at most s coordinates p where y* is more than 
fi/s' larger than yp. Once this is true, the result comes as follows: y* is an overestimate, so the top 
2s elements of y* contain at least s values that have been overestimated by at most fi/s'. Because 
the algorithm passes over an element of value Wi, each of these s values must actually have value 
at least Wi — fi/s'. Hence either Wi < 2/j/s' = |^ or all s values are at least Wi/2. 

To bound the number of badly overestimated coordinates, we split the noise in two components: 
the part from Gj\C(Tj_|_i) and the part from C(rj_|_i). We will show that, with probability l — e~^^'^\ 
the former is at most fi/s' in all but s/4 locations and the latter is zero in all but 3s/4 locations. 

WLOG we assume that the function hi is first fixed for Gi \C(Ti^i), then randomly chosen for 
C(rj_|_i). Let Oi C [u] be the set of "overflow buckets" I such that the sum s; = ^^p^ciTi+i) hi{p)=i Up 
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is at least fij s' . By the definition of /j, s/ = /i/2, so 

10.1/. <|^/.= 1/2^ = 1/64. 

Thus, the probabihty that a fixed child q £ C(Tj+i) is mapped to Oi is at most 1/64. This is 
independent over C{Ti^i), so the Chernofi' bound applies. Hence with probability at least l — e~^^'^\ 
the number of g G C(rj4.i) mapping to Oi is at most twice its expectation, or |C(rj_|_i)| /32 = s/4. 

We now bound the collisions within C(Tj_|_i). Note that our process falls into the "balls into 
bins" framework, but for completeness we will analyze it from first principles. 

Let Z be the number of cells in C(Ti-|_i) that collide. Z is a function of the independent random 
variables hi{p) for p G C{Ti+i), and Z changes by at most 2 if a single hi{p) changes (because p can 
cause at most one otherwise non-colliding element to collide). Hence by McDiarmid's inequality, 

Pv[Z > E[Z] +t]< e-*'/(2s') 

But we know that the chance that a specific p collides with any of the others is at most s' /u = 1/32. 
Hence E[Z] < s'/32, and 

Pr[Z>(l + 6y]<e-^^'/2. 
By setting e = 2/32 we obtain that, with probability 1 — e"^'^''^ we have that Z < ^ = 3s/4. 



Hence with probability 1 — e only 3,s/4 locations have non-zero corruption from C{Ti 



i+lj 



and we previously showed that with the same probability only s/4 locations are corrupted by f /s' 
from outside C(Tj_|_i). By the union bound, this is true for all levels with probability at least 
1 - (log n)e-^(") = 1 - e-^(") . □ 

Lemma 6.7. Let S be the result of running Alaorithm \6. 1\ on y gM}. Then 



Ivclli < 10 min \\y — y'\ 



with probability at least 1 — e 



n{s) 



Proof. From the algorithm definition, Tj = S CiGi for each level i. Let y' Ai minimize ||y — ?/'||^, 
and let U = supp(y'). By the definition oi Ai, U £ T^. 

For each i, define Vi = U (1 C(Ti+i) \ Tj to be the set of nodes in U that could have been chosen 
by the algorithm at level i but were not. For q £ U\S, define R{q) to be the highest ancestor of q 
that does not lie in S; hence R{q) lies in Vi for some level i. Then 



y-' 



qeu\s 

i p&ViR{q)=p 

i pev^ 

= 2ElkvJli' (5) 

i 

where the inequality holds because each element of y' is at least twice the sum of its children. Hence 
the sum of y' over a subtree is at most twice the value of the root of the subtree. 
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Define the error term = ||yGi+i\Ti+i Wi^ ^^'^ suppose that the statement in Lemma WM apphes. 
as happens with probability 1 — e^*^^-*. Then for any level i and p G V^, if Cj is the sth largest value 
in yTi, then yp < max{/j/4s, 2ci} or < || + 2cj. Since yj- contains at least s values larger than 
Cj, and at most \U r\Ti\ = \U Ci C(ri+i)| — \Vi\ < s — \Vi\ of them lie in U, yTi\u must contain at 
least \Vi\ values larger than Cj. This, combined with \ Vi\ < s, gives 

llwJIi </*/4 + 2||yrAC/||i- (6) 
Combining Equations ([5]) and ([6]), we get 

l|y^lli^2[^||(y-y')v;||i + ||yyjli] 

i 

<2\\{y- y')u\\^ + E WvTAuWi + fi/^) 

i 

< 2 ||(y - + 4 \ys\u\^ + hs\i 1^ 

= 2 ||(y - y')c/||i + 4 ||(y - yOswHi + \vs\x /2 

<4||y-y'||^ + ||y^||^/2. 

Therefore 

hs\x ^ + Ik^lli 

< 5 ||y - y'll^ + lly^ll^ /2 
\\vs\V<l^y-y'\\^ 

as desired. □ 



6.3 Application to EMD recovery 

By Lemma l3.1l our recovery algorithm for M gives an EMD/EMD recovery algorithm. 

Theorem 6.8. Suppose k = Q(loglogn). There is a randomized EMD/EMD recovery scheme with 
m = 0{-^klog{n/k)), C = e, and success probability 1 — k~^^^\ Recovery takes 0{^klog{n/k)) 
time. 



7 Wavelet-based method 

We can also instantiate the framework of Section [3] using a reweighted Haar wavelet basis instead 
of P for the embedding B. We will have M. be the tree-sparse model ^^(^fciogn/fc)' 
recovery scheme of Section [51 
The details are deferred to Appendix|Dj We obtain an embedding W defined by a Haar transform 
H (after rescaling the rows), and the following theorem: 

Theorem 7.1. There exists a matrix A with 0{klog{n/k)) rows such that we can recover x* from 
Ax with 

llx* — xll p».,n < C* min II VKx — -ulli < C min ||x — x'|L>f n 

y€TK ^~ k-sparsex'^^ WEMD 

for some distortion C = 0{^J\og{n/k)). 

Note that if we ignore the middle term, this gives the same EMD/EMD result as in Section [5l 
However the middle term may be small for natural images even if the right term is not. In particular, 
it is well known that images tend to be tree-sparse under H. 
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A Invertibility of Pyramid Transform 

If b were {Px)s for some S, then the problem would be fairly easy, since b tells us the mass pq in 
cells q (in particular, if q is at level i, pq = ~). Define the surplus Sq = pq — YlreC{q)Pr be the 
mass estimated in the cell that is not found in the cell's children. 

We start from the case when all surpluses are non-negative (as is the case for {Px)s)- In this 
case, we can minimize \\b — Py\\i by creating Sq mass anywhere in cell q. 

Lemma A.l. Suppose b is such that Sq > for all q G G. Let y be the result of running Algo- 
rithm HTTl on b. Then y minimizes \\b — PyWi- 
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For every cell q £ G, let Cq G M"" denote an elementary unit vector with the 1 located somewhere 
in q (for example, at the center of q). Then return 

g6G 



Algorithm A.l: Recovering y from b to minimize ||6 — Py\\i when all surpluses are non-negative. 



Proof. The vector y has the property that {Py)q > hq for all q €z G, and for the root node r we have 
{Py)r = br- Because the weights are exponential in the level value, any y' minimizing ||6 — -Py'||^ 
must have {Py')r > fen or else increasing any coordinate of y' would decrease \\b — Py'\\-^. But then 



log A 
log A 

> E E ^py\ - 

i=0 qeGi 
log A / 

= E 2-i°g^(Py' 

i=0 \ 

= (2-2-l°g^)(Py')r 

Equality holds if and only if {Py')q > bq for all g G G and {Py')r = &r- Since y has these properties, 
y minimizes \\b — Py\\i- □ 

Unfortunately, finding the exact solution is harder when some surpluses Sq may be negative. 
Then in order to minimize \\b — Py\\i one must do a careful matching up of positive and negative 
surpluses. In order to avoid this complexity, we instead find a greedy 8-approximation. We modify 
b from the top down, decreasing values of children until all the surpluses are non-negative. 




Perform a preorder traversal of G. At each node q at level i, compute the surplus Sq. If Sq is 
negative, arbitrarily decrease b among the children of g by a total of 2*~^ \ sq\, so that b remains 
non-negative. 



Algorithm A. 2: Modifying b to form all non-negative surpluses 



Lemma A. 2. Suppose we run algorithm \A.S\ on a vector b to get b' . Then 

6'|L < 3min||Py - bL . 

M Mi y 

Proof. Let y minimize \\Py — 6||^. As with Py' for any y' , Py has zero surplus at every node. 

At the point when we visit a node q, we have updated our estimate of 6 at q but not at its 
children. Therefore if q is at level i we compute Sq = ^b'q — 'l2seC{q) ^s- Then, because Py has 
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zero surplus, 



{Py)s 



s€C{q) 



< 



2i rq 



K-^\ + ^\bq-iPyU + ^ E \bs-{Py)s\- 



seCiq) 



Define fi = J^qed 1^9 ~ iPy)q\ be the original ii error on level i, and gi = J2qeG, \K ~ ^"jI ^° 
be a bound on the amount of error we add when running the algorithm. Because we only modify 
values enough to rectify the surplus of their parent, we have 



q(^G^ 



q&Gi 



seC(q) 



Unrolling the recursion, we get 



loe A— i 



gi < 



j=0 



as desired. 

This lets us prove Lemma 14.31 



□ 



Lemma 14.31 Given any approximation b to Fx, running the previous two algorithms gives a y 
with 

\\Py-Px\\^ <S\\b-Px\\^ 

in 0(|supp(6)|) time. 

Proof. By running Algorithm IA.2I on 6, we get h' with — < 3 — Then we run 
Algorithm lA. II on h' to get y that minimizes \\Py — Then 

\\Py - Px\\^ < \\Py - h'W^ + \\Px - b'W^ 
< 2 \\Px - b'W^ 
<2{\\Px-b\\, + \\b'-by 
<8\\Px-b\\^. 

To bound the recovery time, note that after Algorithm IA.2I visits a node with value 0, it sets the 
value of every descendant of that node to 0. So it can prune its descent when it first leaves supp(6), 
and run in 0(|supp(6)|) time. Furthermore, this means |supp(6')| < |supp(6)| and supp(6') is a 
top-down tree. Hence Algorithm lA.ll can iterate through the support of b' in linear time. □ 
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B Model-based compressive sensing 



In this section we first provide a quick review of model-based sparse recovery, including the relevant 
definitions, algorithms and their guarantees. We then show how to augment the algorithm so that 
it provides the guarantees that are needed for our EMD algorithms. 

B.l Background 

Model-based RIP Given a signal model A4k, we can formulate the A^/f-restricted isometry 
property (A^^'-RIP) of an m x n matrix A, which suffices for performing sparse recovery. 

Definition B.l. A matrix A satisfies the Mk-RIP with constant 5 if for any x G Mk, we have 

{l-5)\\x\\^<\\Ax\\^<{l + 6)\\x\\^ 

It is known that random Gaussian matrices with m = 0{klog{n/k)) rows satisfy the S^- 
RIP (i.e., the "standard" RIP), with very high probability, and that this bound cannot be im- 
proved jPIPWlO] . In contrast, it has been shown that in order to satisfy the Ta'-RIP, only 
m = 0{K) rows suffice [BCDHIO] . The intuitive reason behind this is that the number of rooted 
trees of size K is 2<^(^) while the number of sets of size k is (^) = 2®('''°s("/'=)). 

Algorithms Given a matrix A that satisfies the A^i^-RIP, one can show how to recover an 
approximation to a signal from its sketch. The specific theorem (proven in [ BCDHlO] and re-stated 
below) considers ^2 recovery of a "noisy" sketch Ax + e, where e is an arbitrary "noise" vector, 
while X & Mk- In the next section we will use this theorem to derive an ii result for a different 
scenario, where x is an arbitrary vector, and we are given its exact sketch Ax. 

Theorem B.2. Suppose that a matrix A satisfies M.^^ -RIP with constant 6 < 0.1. Moreover, 
assume that we are given a procedure that, given y G M", finds y* £ Aix that minimizes ||y — y*||2- 
Then there is an algorithm that, for any x £ Mr, given Ax + e, e 7^ 0, finds x* G Mr such that 

\\x - x*\\2 < C\\e\\2 

for some absolute constant C > 1. The algorithm runs in time 0((n + T + MM) log(||x||2/||e||2)), 
where T is the running time of the minimizer procedure, and MM is the time needed to perform 
the multiplication of a vector by the matrix A. 

Note that the algorithm in the theorem has a somewhat unexpected property: if the sketch is 
nearly exact, i.e., e ~ 0, then the running time of the algorithm becomes unbounded. The reason 
for this phenomenon is that the algorithm iterates to drive the error down to ||e||2, which takes 
longer when e is small. However, as long as the entries of the signals x, x* and the matrix A have 
bounded precision, e.g., are integers in the range 1, . . . ,L, one can observe that O(logL) iterations 
suffice. 

The task of minimizing ||y — ?/*||2 over y* G Mr can typically be accomplished in time polyno- 
mial in K and n. In particular, for Mr = Tr, there is a simple dynamic programming algorithm 
solving this problem in time 0{k^n). See, e.g., [ClHB09| for a streamlined description of the algo- 
rithms for (a somewhat more general) problem and references. For more mathematical treatment 
of tree approximations, see [CDDDOl] . 

The following lemma (from j NTOSj ) will help us bound the value of ||e||2. 
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Lemma B.3. Assume that the matrix A satisfies the (standard) Tjg-RIP with constant 5. Then 
for any vector z, we have ||^-2||2 < VT~+~^(lk5||2 + lk||i/-v/s), where S is the set of the s largest 
(in magnitude) coefficients of z. 

For completeness, we also include a proof. It is different, and somewhat simpler than the original 
one. Moreover, we will re-use one of the arguments later. 

Proof. We partition the coordinates of S into sets Sq,Si, S2, ■ ■ ■ ,St, such that (i) the coordinates 
in the set Sj are no larger (in magnitude) than the coordinates in the set Sj^i, j > 1, and (ii) all 
sets but St have size s. We have 

\\M\2 < 



< 



< 

< 

□ 



i=o 

t 

VT+d{\\zsj2 + Y.\\^S,\\2) 

s 

VT+d{\\zsj2 + Y.V~s{\\zs,.Ai/s)) 
VT+si\\z\\2 + \\z\\i/V^) 



B.2 New result 

We start from the following observation relating general sparsity and tree sparsity. Consider k and 
K such that K = c'klog{n/k) for some constant c'. 

Claim B.4. Assume n = ^l^r for some (constant) integer c. Then there exists a constant d such 
that Sfc C Tk- 

Proof. It suffices to show that for any S C [n] of size k there exists a rooted connected subset T of 
T(c, /) of size K such that S C T. The set T is equal to T' U T" , where (i) T' consist of all nodes 
in the tree T{c,l) up to level [log^ A:] and (ii) T" consists of all paths from the root to node i, for 
i G S. Note that |T'| = 0{k), and |r"\T'| = 0(A;(log n - log A;)) = 0(A: log(n/A:)). □ 

This claim is used in the following way. As we will see later, in order to provide the guarantee 
for recovery with respect to the model Tk, we will need to perform the recovery with respect to the 

model Tk © S^. From the claim it follows that we can instead perform the recovery with respect 

(2) 

to the model C Tk- 

Specifically, we show the following. 

Theorem B.5. Suppose that we are given a matrix and minimizer subroutine as in Theorem \B.S\ 
for Tk- Then, for any x, given the vector Ax, the approximation x* computed by the algorithm in 
Theorem \B.^ satisfies 

\\x - x*\\i < (1 + 2Cv/(l + 5)c'log(n/A:)) min ||x — x'lli 
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Proof. Let x' S Tk be the minimizer of ||x — x'\\i. Let T be a tree of size K such that x' = xt, 
and define the "£i approximation error" S = ||x — = ||x7jt||i 

Let P C T be the set of the k largest (in magnitude) coordinates of x^- By Claim [B74] it follows 
that P C T', for some T' E Tk- Let T" = T U T' . 

We decompose Ax into Axt» + Axjw = Axt" + e. Since xt" S T2K, by Theorem IB .21 we have 

\\xT"-x*\\2 < C\\e\\2 (7) 

Let r* be the support of x*. Note that |r*| < 2K. 

Since A satisfies the (standard) RIP of order k with constant 6 = 0.1, by Lemma lB.31 we have 

||e||2 < Vl + '5[||2;5||2 + lla^jxjplli /^] 



where 5 C T U P is the set of the k largest (in magnitude) coordinates of Xj^jp. By the definition 
of P, every coordinate of \xs\ is not greater than the smallest coordinate of \xp\. By the same 
argument as in the proof of Lemma lB.31 it follows that H^^Hg < II^^pH]^ /Vk, so 



We have 



e||2 < vHl+^lkrlli- (8) 



\x - = ||(x - x*)t"ut*\\i + a^*) T"UT* lll 

< \\xt" — x*!!]^ + ||2;t.\2"h||_^ + 11(3; — x*) j,„^jrp, II 
= \\xt" — x*\\^ + ||2;jv7||^ 

< \\xt" — x*\\-^^ + E 

< VaK\\xt" - x*\\2 + E 

< V^C\\e\\2 + E 



< ViKC^/{l + 5)/k \\x^\\^ + E 
= {l + 2C^/il + 5)K/k)E 



= (1 + 2C^il + 6)c' login/ k))E 
by Equations [7] and El □ 



C Strict sparse approximation 

In this section we show how to reduce the sparsity of an approximation down to k for an arbitrary 
norm || • ||. This reduction seems folklore, but we could not find an appropriate reference, so we 
include it for completeness. 

Consider a sparse approximation scheme that, given Ax, returns (not necessarily sparse) vec- 
tor X* such that — x|| < C min^.sparse x' ll^^' ~ x\\; let x' be the the minimizer of the lat- 
ter expression. Let x be the approximately best k- sparse approximation to x*, i.e., such that 
||x — < Cmiufc.sparse x" \\x" — x*\\] let x" be the minimizer of the latter expression. Note that 
since x' is /c-sparse, it follows that — < — 

Claim C.l. We have 

\\x-x\\ < [{C + 1)C + C']\\x' - x\\ 
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Proof. 



— a^ll < ||x — X* II + — x|| 

< C"||x" - 2;*|| + ||x* - x|| 

^ l|/ *llill* II 

< G ||x — X II + ||x — x|| 

< C"[||x' - x|| + ||x - x*||] + ||x* - x|| 
= (C" + l)||x* -x|| +C7'||x'-x|| 

< (C' + l)C||x'-x|| +C"||x'-x|| 
= [(C7' + l)C7 + C']||x'-x|| 

D Wavelet-based method 

We start by recalling the definition of the non-standard two-dimensional Haar wavelet basis (see |SDS95] 
for an overview). Let H £ R"^" be the matrix with rows corresponding to the basis vectors. We 
will define H in terms of the grids Gj. The first row of H has all coordinates equal to 1/n. The 
rest of H consists of three rows for each cell C € Gi for z > 1. For each cell C, the corresponding 
rows contain zeros outside of the coordinates corresponding to C. The entries corresponding to C 
are defined as follows: (i) one row has entries equal to 2~* for each entry corresponding to the left 
half of C and equal to — 2~* for each entry corresponding to the right half of C; (ii) the second row 
has entries equal to 2~* for the top half of C and to — 2~* for the bottom half; (ii) and the third 
row has entries equal to 2~* for the top left and bottom right quadrants of C, and equal to —2"* 
for the other two quadrants. 

We define W to transform into the same basis as H, but with rescaled basis vectors. In 
particular, the basis vectors from level i are smaller by a factor of 2^*~^, so the non-zero entries 
have magnitude 2'^"^*. This is equivalent to changing the coefficients of the corresponding rows of 
W to be 2*~^ rather than 2~*. Similarly, we rescale the all-positive basis vector to have coefficients 
equal to Then W = DH for some diagonal matrix D. 

This rescaling is such that the columns of W^^, call them Vi, all have ||t'i||£;j\^£) = 1. This is 
because the min-cost matching moves each of 2^*/2 coefficients by 2*/2. So we have 



< 



EMD = 

^\\{Wx)m\\EMD 



EMD 



x),;| = ||VFx||^ , 



which is Property lAl of the framework. 

Property [O is easy since W has a known inverse (namely H^D~^), giving ||y — M^M^~^2/||j^ = 
for all y. All that remains to show is Property [Bj 

Lemma D.l. For all x G M", there exists a y £ To{^k\og{n/k)) ^^^^ 

\\y-Wx\\^<e min \\x-Xk\\EMD- 

k-sparse xj^ 

Proof. We will show this using Lemma 14.11 as a black box. We know there exists a support S of 
Px corresponding to a tree of grid cells such that 

||(i^2;)sr|| <e min ||x - Xfc||^^^^ . 

fe-sparse 
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Let S' be a support of Wx that contains the all-constant basis vector as well as, for each cell C E Gi 
in S with i > 1, the three coefficients in Wx corresponding to C. Then S' is also a tree. 

For any cell C £ Gi, let u be the row in P corresponding to C and v be any of the three rows 
in W corresponding to C. Then 



klloo = 2' 



i-2 



1 

4 ll^'^Hoo 



So the only difference between v and u is that (i) v has one fourth the magnitude in each coefficient 
and (ii) some coefficients of v are negative, while all of u are positive. Hence for positive x, 
\v ■ x\ < \ \u- x\. This gives 

3 3 
< - IKPx)^!! < -e min \\x-Xk\\EMD- 

^ 4 4 B-sparse 



as desired. 

Theorem D.2. This gives 



□ 



\x* — x\\ej^j-) < C min \\Wx ■ 

V^Tk 



for some distortion G = 0{^\og{n/k)). 



I < G min \\x — x' 

k-sparse x' 



\EMD 



□ 
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